10 research outputs found
An Efficient Thread Mapping Strategy for Multiprogramming on Manycore Processors
The emergence of multicore and manycore processors is set to change the
parallel computing world. Applications are shifting towards increased
parallelism in order to utilise these architectures efficiently. This leads to
a situation where every application creates its desirable number of threads,
based on its parallel nature and the system resources allowance. Task
scheduling in such a multithreaded multiprogramming environment is a
significant challenge. In task scheduling, not only the order of the execution,
but also the mapping of threads to the execution resources is of a great
importance. In this paper we state and discuss some fundamental rules based on
results obtained from selected applications of the BOTS benchmarks on the
64-core TILEPro64 processor. We demonstrate how previously efficient mapping
policies such as those of the SMP Linux scheduler become inefficient when the
number of threads and cores grows. We propose a novel, low-overhead technique,
a heuristic based on the amount of time spent by each CPU doing some useful
work, to fairly distribute the workloads amongst the cores in a
multiprogramming environment. Our novel approach could be implemented as a
pragma similar to those in the new task-based OpenMP versions, or can be
incorporated as a distributed thread mapping mechanism in future manycore
programming frameworks. We show that our thread mapping scheme can outperform
the native GNU/Linux thread scheduler in both single-programming and
multiprogramming environments.Comment: ParCo Conference, Munich, Germany, 201
Cache-aware Parallel Programming for Manycore Processors
With rapidly evolving technology, multicore and manycore processors have
emerged as promising architectures to benefit from increasing transistor
numbers. The transition towards these parallel architectures makes today an
exciting time to investigate challenges in parallel computing. The TILEPro64 is
a manycore accelerator, composed of 64 tiles interconnected via multiple 8x8
mesh networks. It contains per-tile caches and supports cache-coherent shared
memory by default. In this paper we present a programming technique to take
advantages of distributed caching facilities in manycore processors. However,
unlike other work in this area, our approach does not use architecture-specific
libraries. Instead, we provide the programmer with a novel technique on how to
program future Non-Uniform Cache Architecture (NUCA) manycore systems, bearing
in mind their caching organisation. We show that our localised programming
approach can result in a significant improvement of the parallelisation
efficiency (speed-up).Comment: This work was presented at the international symposium on Highly-
Efficient Accelerators and Reconfigurable Technologies (HEART2013),
Edinburgh, Scotland, June 13-14, 201
The Glasgow Parallel Reduction Machine: Programming Shared-memory Many-core Systems using Parallel Task Composition
We present the Glasgow Parallel Reduction Machine (GPRM), a novel, flexible
framework for parallel task-composition based many-core programming. We allow
the programmer to structure programs into task code, written as C++ classes,
and communication code, written in a restricted subset of C++ with functional
semantics and parallel evaluation. In this paper we discuss the GPRM, the
virtual machine framework that enables the parallel task composition approach.
We focus the discussion on GPIR, the functional language used as the
intermediate representation of the bytecode running on the GPRM. Using examples
in this language we show the flexibility and power of our task composition
framework. We demonstrate the potential using an implementation of a merge sort
algorithm on a 64-core Tilera processor, as well as on a conventional Intel
quad-core processor and an AMD 48-core processor system. We also compare our
framework with OpenMP tasks in a parallel pointer chasing algorithm running on
the Tilera processor. Our results show that the GPRM programs outperform the
corresponding OpenMP codes on all test platforms, and can greatly facilitate
writing of parallel programs, in particular non-data parallel algorithms such
as reductions.Comment: In Proceedings PLACES 2013, arXiv:1312.221
GPRM: a high performance programming framework for manycore processors
Processors with large numbers of cores are becoming commonplace. In order to utilise the
available resources in such systems, the programming paradigm has to move towards increased parallelism. However, increased parallelism does not necessarily lead to better performance. Parallel programming models have to provide not only flexible ways of defining
parallel tasks, but also efficient methods to manage the created tasks. Moreover, in a general-purpose system, applications residing in the system compete for the shared resources. Thread
and task scheduling in such a multiprogrammed multithreaded environment is a significant challenge.
In this thesis, we introduce a new task-based parallel reduction model, called the Glasgow Parallel Reduction Machine (GPRM). Our main objective is to provide high performance while maintaining ease of programming. GPRM supports native parallelism; it provides a modular way of expressing parallel tasks and the communication patterns between them. Compiling a GPRM program results in an Intermediate Representation (IR) containing useful information about tasks, their dependencies, as well as the initial mapping information. This compile-time information helps reduce the overhead of runtime task scheduling and is key to high performance. Generally speaking, the granularity and the number of tasks are major factors in achieving high performance. These factors are even more important in the case of GPRM, as it is highly dependent on tasks, rather than threads.
We use three basic benchmarks to provide a detailed comparison of GPRM with Intel OpenMP, Cilk Plus, and Threading Building Blocks (TBB) on the Intel Xeon Phi, and with GNU OpenMP on the Tilera TILEPro64. GPRM shows superior performance in almost all cases, only by controlling the number of tasks. GPRM also provides a low-overhead mechanism, called “Global Sharing”, which improves performance in multiprogramming situations.
We use OpenMP, as the most popular model for shared-memory parallel programming as the main GPRM competitor for solving three well-known problems on both platforms: LU factorisation of Sparse Matrices, Image Convolution, and Linked List Processing. We focus on proposing solutions that best fit into the GPRM’s model of execution. GPRM outperforms OpenMP in all cases on the TILEPro64. On the Xeon Phi, our solution for the LU Factorisation results in notable performance improvement for sparse matrices with large numbers of small blocks. We investigate the overhead of GPRM’s task creation and distribution for very short computations using the Image Convolution benchmark. We show that this overhead can be mitigated by combining smaller tasks into larger ones. As a result, GPRM can outperform OpenMP for convolving large 2D matrices on the Xeon Phi. Finally, we demonstrate that our parallel worksharing construct provides an efficient solution for Linked List processing and performs better than OpenMP implementations on the Xeon Phi.
The results are very promising, as they verify that our parallel programming framework for manycore processors is flexible and scalable, and can provide high performance without
sacrificing productivity
Arm Mbed – AWS IoT System Integration [Open access]
This project explores the different Internet of Things (IoT) architectures and the available platforms
to define a general IoT Architecture to connect Arm microcontrollers to Amazon Web Services. In
order to accommodate the wide range of IoT applications, the architecture was defined with different
routes that an Arm microcontroller can take to reach AWS. Once this Architecture was defined, a
performance analysis on the different routes was performed in terms of communication speed and
bandwidth. Finally, a Smart Home use case scenario is implemented to show the basic functionalities
of an IoT system such as sending data to the device and data storage in the Cloud. Furthermore, a
Cloud ML algorithm is triggered in real time by the Smart Home to receive a prediction of the current
Comfort Level in the room
Efficient Parallel Linked List Processing
OpenMP is a very popular and successful parallel programming API, but efficient parallel traversal of a list (of possibly unknown size) of items linked by pointers is a challenging task: solving the problem with OpenMP worksharing constructs requires either transforming the list into an array for the traversal or for all threads to traverse each of the elements and compete to execute them. Both techniques are inefficient. OpenMP 3.0 allows to addresses the problem using pointer chasing by a master thread and creating a task for each element of the list. These tasks can be processed by any thread in the team. In this study, we propose a more efficient cutoff-based linked list traversal using our task-based parallel programming model, GPRM. We compare the performance of this technique in both GPRM and OpenMP implementations with the conventional OpenMP implementation, which we call Task-Per-Element (TPE)
Steal locally, share globally
In a general-purpose computing system, several parallel applications run simultaneously on the same platform. Even if each application is highly tuned for that specific platform, additional performance issues are arising in such a dynamic environment in which multiple applications compete for the resources. Different scheduling and resource management techniques have been proposed either at operating system or user level to improve the performance of concurrent workloads. In this paper, we propose a task-based strategy called “Steal Locally, Share Globally” implemented in the runtime of our parallel programming model GPRM (Glasgow Parallel Reduction Machine). We have chosen a state-of-the-art manycore parallel machine, the Intel Xeon Phi, to compare GPRM with some well-known parallel programming models, OpenMP, Intel Cilk Plus and Intel TBB, in both single-programming and multiprogramming scenarios. We show that GPRM not only performs well for single workloads, but also outperforms the other models for multiprogramming workloads. There are three considerations regarding our task-based scheme: (i) It is implemented inside the parallel framework, not as a separate layer; (ii) It improves the performance without the need to change the number of threads for each application (iii) It can be further tuned and improved, not only for the GPRM applications, but for other equivalent parallel programming models
Arm Mbed – AWS IoT System Integration [Open access]
This project explores the different Internet of Things (IoT) architectures and the available platforms
to define a general IoT Architecture to connect Arm microcontrollers to Amazon Web Services. In
order to accommodate the wide range of IoT applications, the architecture was defined with different
routes that an Arm microcontroller can take to reach AWS. Once this Architecture was defined, a
performance analysis on the different routes was performed in terms of communication speed and
bandwidth. Finally, a Smart Home use case scenario is implemented to show the basic functionalities
of an IoT system such as sending data to the device and data storage in the Cloud. Furthermore, a
Cloud ML algorithm is triggered in real time by the Smart Home to receive a prediction of the current
Comfort Level in the room